Parsing Arabic Using Treebank-based Lfg Resources

نویسندگان

  • Lamia Tounsi
  • Mohammed Attia
  • Josef van Genabith
  • Miriam Butt
  • Tracy Holloway
چکیده

In this paper we present initial results on parsing Arabic using treebank-based parsers and automatic LFG f-structure annotation methodologies. The Arabic Annotation Algorithm (A) (Tounsi et al., 2009) exploits the rich functional annotations in the Penn Arabic Treebank (ATB) (Bies and Maamouri, 2003; Maamouri and Bies, 2004) to assign LFG f-structure equations to trees. For parsing, we modify Bikel’s (2004) parser to learn ATB functional tags and merge phrasal categories with functional tags in the training data. Functional tags in parser output trees are then “unmasked” and available to A to assign f-structure equations. We evaluate the resulting f-structures against the DCU250 Arabic gold standard dependency bank (Al-Raheb et al., 2006). Currently we achieve a dependency f-score of 77%. 1 Related Work Arabic parsing systems have been reported in (Ditters, 2001; Zabokrtsky and Smrz, 2003; Othman et al., 2003; Ramsay et Mansour, 2007). (Attia, 2008) gives an overview of an LFG rule-based analysis of Arabic using XLE (Xerox Linguistics Environment). He concentrated on short sentences and used robustness techniques to increase the coverage. All of these use hand-crafted grammars, which are time-consuming to produce and difficult to scale to unrestricted data. More recently, the Penn Arabic Treebank (ATB) has been employed to acquire wide-coverage parsing resources. The best-known Arabic statistical parser was developed by Bikel (Bikel, 2004). Bikel reports parse quality “far below” English and Chinese (Kulick et al., 2006). The main reasons cited were a significant number of POS-tag inconsistencies (in the version of the ATB available at the time) and the considerable differences between Arabic and English sentence structure. (Dieb et al., 2004) and (Habash and Rambow, 2005) present knowledgeand machine-learning-based methods for tokenisation, basic POS tagging with a reduced tagset and base phrase chunking. Bikel’s parser produces phrase-structure trees (c-structures). The main objective of our research is to automatically enrich the output of Bikel’s parser with more abstract and “deep” dependency information (in the form of LFG f-structures), using the Arabic A annotation algorithm (Tounsi et al., 2009), extending the approach of (Cahill et al., 2004), originally developed for English. 2 The Penn Arabic Treebank (ATB) Arabic is a subject pro-drop language. It has relatively free word order: mainly S(ubject) V(erb) and O(bject), with VSO and VOS also possible. Arabic is a highly inflectional and cliticizing language. The ATB consists of 23,611 parse-annotated sentences (Bies and Maamouri, 2003; Maamouri and Bies, 2004) from Arabic newswire text in Modern Standard Arabic (MSA). The ATB annotation scheme involves 497 different POS-tags with morphological information (reduced to 24 basic POS-tags by Bikel e.g. NN, NNS, JJ), 22 phrasal tags e.g. NP, VP, PP and 20 functional tags e.g. SBJ, OBJ, TPC (52 combined functional tags, as functional tags can stack). 3 The Arabic Annotation Algorithm (A) The Arabic Annotation Algorithm (Tounsi et al., 2009) is constructed adapting and revising the methodology of (Cahill et al., 2004) for English as follows: 1. Automatic extraction of the most frequent rule types from the treebank. 2. Head lexicalisation of ATB trees to identify local heads. 3. Default f-structure equations are assigned to ATB functional tags. In addition, lexical macros exploits the rich morphological information provided by the ATB. With 85% token coverage.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

DCU 250 Arabic Dependency Bank: An LFG Gold Standard Resource for the Arabic Penn Treebank

This paper describes the construction of a dependency bank gold standard for Arabic, DCU 250 Arabic Dependency Bank (DCU 250), based on the Arabic Penn Treebank Corpus (ATB) (Bies and Maamouri, 2003; Maamouri and Bies, 2004) within the theoretical framework of Lexical Functional Grammar (LFG). For parsing and automatically extracting grammatical and lexical resources from treebanks, it is neces...

متن کامل

Automatic Extraction and Evaluation of Arabic LFG Resources

This paper presents the results of an approach to automatically acquire large-scale, probabilistic Lexical-Functional Grammar (LFG) resources for Arabic from the Penn Arabic Treebank (ATB). Our starting point is the earlier, work of (Tounsi et al., 2009) on automatic LFG f(eature)-structure annotation for Arabic using the ATB. They exploit tree configuration, POS categories, functional tags, lo...

متن کامل

Treebank-Based Acquisition of Chinese LFG Resources for Parsing and Generation

This thesis describes a treebank-based approach to automatically acquire robust, wide-coverage Lexical-Functional Grammar (LFG) resources for Chinese parsing and generation, which is part of a larger project on the rapid construction of deep, large-scale, constraint-based, multilingual grammatical resources. I present an application-oriented LFG analysis for Chinese core linguistic phenomena an...

متن کامل

Treebank-Based Acquisition of LFG Resources for Chinese

This paper presents a method to automatically acquire wide-coverage, robust, probabilistic Lexical-Functional Grammar resources for Chinese from the Penn Chinese Treebank (CTB). Our starting point is the earlier, proofof-concept work of (Burke et al., 2004) on automatic f-structure annotation, LFG grammar acquisition and parsing for Chinese using the CTB version 2 (CTB2). We substantially exten...

متن کامل

Arabic Parsing Using Grammar Transforms

We investigate Arabic Context Free Grammar parsing with dependency annotation comparing lexicalised and unlexicalised parsers. We study how morphosyntactic as well as function tag information percolation in the form of grammar transforms (Johnson, 1998, Kulick et al., 2006) affects the performance of a parser and helps dependency assignment. We focus on the three most frequent functional tags i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009